1 Learning outcomes for today

What you will learn in this practical:

  • How to upload a dataset into R
  • How to explore a new dataset
  • How to conduct the five data analysis steps to answer a scientific question
  • How to write hypotheses
  • How to conduct and interpret Mann-Whitney U tests and t-tests
  • How to write conclusions based on evidence

2 Practical 2 learning outcomes

In last week’s practical you learned:

  • How to do more advanced plotting in ggplot2, including how to colour plots by a variable; add a linear trend line to a scatter plot: +geom_smooth(); and add jittered points to a boxplot: + geom_jitter().

  • How to identify whether a numerical variable is normally distributed or not.

  • How to conduct a Shapiro Wilk test: shapiro.test(data$column).

  • How to interpret the output of a Shapiro Wilk test.

3 Preliminary tasks

1. If you have not completed practicals 1 and 2, you need to do these first.

2. If you are ready to start Practical 3, DOWNLOAD the third practical worksheet from Blackboard, do not just view it. To download it, click the “…” next to the file and click ‘Download original file’. Save this file in an appropriately named folder on your OneDrive (e.g., “MODULE NAME/WEEK X”) so you can access it at home.

3. Navigate on your web browser to <https://login.rstudio.cloud/>. Assuming you have already registered for practical 1 - 3, then sign in. If prompted, click “Posit Cloud”.

4. You should already have an existing R studio project from the first practical called ‘R workshop’, click on this to enter your saved workspace.

5. Start a new R script for this practical. Once your project has loaded (it takes a few seconds), click ‘File’ then ‘New file’, then ‘R Script’.

6. Press the Save button and save your script as “Practical_3.R”. This will save all your code you write today.


4 Recap: How to run code in R

R considers each *numbered line* as a stand-alone line of code. To run a line of code in R, put your mouse cursor on the line you want to run and press **Control and Enter** (this is the most common method).

Alternatively, press the ‘Run’ button above the code editor.

Anything after a # key is NOT code, it is comments/notes. R will ignore any writing after the # key. You do not need to copy out the comments into R.

5 Recap: Plotting in ggplot2

In practicals 1 and 2, we focused mainly on learning how to visualise data using the R package ggplot2. You need to be familiar with the following code:

NOTE: There are a lot more things you can do with ggplot2 that we haven’t had time to explore. More examples of all the kinds of plots you can make with ggplot2, and the code to make them, can be found at https://r-graph-gallery.com/. If you click on the plot you want, it will provide the code to make it.


6 Uploading a new data set to R

We will now analyse a new biological data set that is based on data collected from tigers in zoos.


This dataset includes data from a study that looked at the effect on antibiotic treatment and diet treatment on captive tiger weight and parasite load across zoos in the United States. Parasite load was measured as the number of parasite eggs found in tiger faecal samples.

The scientists also collected data on tiger age and sex.

The new data set (tiger_data.csv) is on Blackboard alongside the practical 3 tutorial. Instructions for how to load it into your R studio cloud project are below. Note, pay attention as you will need to do this for your assignment data! If you are using R studio software, you can skip steps 2 and 3.

  1. Download the data file from Blackboard to your OneDrive folder for this session.

  2. From within R studio cloud, in your bottom right window navigate to the File tab and click ‘Upload’



  1. Under ‘file to upload’, navigate to the folder where you saved the data set



You should now see your data set tiger_data.csv in your working directory:



  1. Next you need to load the new dataset into your working environment, and call it ‘tiger_data’. You can do this by pressing the ‘Import dataset’ button above Environment space, then ’From Text (base).


  1. Open your tiger_data.csv file, make sure it is being imported as ‘tiger_data’ (top left box in pop-up window), and press ‘Import’:

You should now have have your data object tiger_data in your working environment.


  1. Copy the import code from your console into your R script, so that in the future you can just run one line of code to import the dataset again, eg:

tiger_data <- read.csv(“OneDrive/Research Skills/Week 7/tiger_data.csv”)

The pathway above should be where you saved your data on your OneDrive.


7 Exploring the new tiger dataset

First lets have a look at what this dataset looks like by using the head() and View() function.

head(tiger_data) # Look at the first six rows in the terminal
View(tiger_data) # examine the dataset. Take note of the column names

dim(tiger_data) # Examine how many rows (obervations) and columns (variables) there are. 
## [1] 110   7
names(tiger_data) # What are the column/variable names?
## [1] "tiger_ID"             "weight_kg"            "parasite_load"       
## [4] "antibiotic_treatment" "diet_treatment"       "age_years"           
## [7] "sex"

Now lets explore the dataset and apply some of the things you’ve learned in previous practicals.

7.1 Exercise 1

Check if your variables (ie column names) have been automatically assigned the correct data type by R (catagorical/continuous/count). Write out how each variable is treated by R, and the correct data type of each variable.

Remember, categorical data in R is called “factor” or “character”, count data is called “integer”, and continous data is called “numeric”.

Hint: Use the str() function to examine data types.

Hint 2: tigers are individuals, not numbers.

str(tiger_data)
## 'data.frame':    110 obs. of  7 variables:
##  $ tiger_ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weight_kg           : num  116.1 103 119.4 124.3 99.3 ...
##  $ parasite_load       : int  40 18 14 52 13 34 7 30 7 14 ...
##  $ antibiotic_treatment: chr  "With Antibiotics" "With Antibiotics" "With Antibiotics" "With Antibiotics" ...
##  $ diet_treatment      : chr  "High Fat" "High Protein" "High Fat" "High Protein" ...
##  $ age_years           : int  2 5 8 5 2 8 6 2 3 9 ...
##  $ sex                 : chr  "Female" "Female" "Male" "Female" ...

ANSWERS

  • What data types each variable is treated as in R:

Tiger ID = Count

Weight = Continous

Antibiotic treatment = Catagorical

Diet treatment = Categorical

Parasite load = Count

Age = Count

Sex = Categorical


  • What data types each variable should be:

Tiger ID = Categorical

Weight = Continuous

Antibiotic treatment = Categorical

Diet treatment = Categorical

Parasite load = Count

Age = Continuous/Count

Sex = Categorical


Only tiger_ID is incorrectly categorised by R. You therefore need to correct that.

7.2 Exercise 2

Use the function as.factor() to change tiger ID into categorical data, so that R does not treat is as numerical or count data.

tiger_data$tiger_ID <- as.factor(tiger_data$tiger_ID)

# check it has changed

str(tiger_data)
## 'data.frame':    110 obs. of  7 variables:
##  $ tiger_ID            : Factor w/ 110 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weight_kg           : num  116.1 103 119.4 124.3 99.3 ...
##  $ parasite_load       : int  40 18 14 52 13 34 7 30 7 14 ...
##  $ antibiotic_treatment: chr  "With Antibiotics" "With Antibiotics" "With Antibiotics" "With Antibiotics" ...
##  $ diet_treatment      : chr  "High Fat" "High Protein" "High Fat" "High Protein" ...
##  $ age_years           : int  2 5 8 5 2 8 6 2 3 9 ...
##  $ sex                 : chr  "Female" "Female" "Male" "Female" ...

7.3 Exercise 3

In practicals 1 and 2 you learned how to summarise categorical and numerical data using summary() and table(). Use these functions to answer the following questions:

  1. How many tigers were given antibiotics, and how many tigers were not given antibiotics?

    table(tiger_data$antibiotic_treatment)
    ## 
    ##   No Antibiotics With Antibiotics 
    ##               55               55
    # Answer: antibiotics = 55, no antibiotics = 55
  2. Given the above, what is your total sample size?

    # Answer: 110
  3. How many tigers were given the high fat diet, and how many were given the high protein diet?

    table(tiger_data$diet_treatment)
    ## 
    ##     High Fat High Protein 
    ##           48           62
    # Answer: High fat = 48, High protein = 62
  4. What is the mean parasite load across tigers?

    summary(tiger_data$parasite_load)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##     0.0    14.0    26.5    31.3    43.0   100.0
    # Answer = 31.3 parasite eggs
  5. What is the mean weight across tigers?

summary(tiger_data$weight_kg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    90.7   104.3   113.9   114.0   122.1   142.5
# Answer = 114kg

8 Introduction to statistics

We will now begin to statistically analyse this new dataset. Statistics is the process of fitting statistical models to you data in order to generate statistical evidence for or against your hypotheses. Unfortunately, there is no one model that fits all data. The model needs to be tailored to the kind of data you have.

The type of statistical model you fit depends on three things:

  1. The type of data. This can be visualized by what kind of plot you can generate.

  2. Whether your numerical variable of interest is normally distributed or not.

  3. Whether your data points are independent from each other or not. For now, though, we will ignore this point and assume all our data points are independent.

Below is a summary of the names of models you can use for what kind of data, and the R function needed to perform the test.

Over the next practicals, we will focus on learning how to fit these models to our data, how to interpret the outputs, and how to write out our conclusions.

In addition, the code for these statistical tests is also on your cheatsheet handout:


9 The five data analysis steps

For every analysis you do, you need to go through five steps:

  1. Write out your null and alternative hypotheses using IF/THEN statements.

  2. Plot the relationship between your variables of interest.

  3. Check if your numerical data has a normal (bell shaped) distribution.

  4. Conduct the appropriate statistical test.

  5. Write out your results using evidence generated from the statistical model.

We will now go through these steps to choose what model we need when presented with a scientific question.

10 Question 1: Does antibiotic treatment affect parasite load?

For this question, I will guide you through the 5 analysis steps. For the next question, I will expect you to complete the steps yourself. You will also need to go through these steps for your assignment.

Background reasoning: Antibiotics are used to kill bacteria, and also sometimes kill parasites too. Therefore, they are sometimes used as an anti-parasitic treatment. You therefore think that antibiotic treatment should be linked to parasite load in tigers.

10.1 Step 1: Hypotheses

Write down our null and alternative hypotheses. Remember, hypotheses should provide a biological explanation of your predictions:

NULL: IF [explanation or mechanism not true], THEN [what pattern you expect in your dataset]

ALTERNATIVE: IF [explanation or mechanisms is true], THEN [what pattern you expect in your dataset]

Lets have a go:

NULL: If antibiotics do not kill parasites, then there will be no difference in parasite load between tigers that were treated with antibiotics and those that weren’t.

ALTERNATIVE: If antibiotics kills parasites, then tigers that were treated with antibiotics will have fewer parasites, on average, than those that no antibiotic treatment.

10.2 Step 2: Plot relationship

Step 2 is to plot the relationship you are interested in testing. You need to work out what kind of plot you need using the following guide:

10.3 Exercise 4

Plot the relationship between antibiotic treatment and parasite load.

Remember, you always need to load the ggplot2 package before plotting using library(), so don’t forget this line of code.

In addition, the variable that you are interested in predicting, in this case parasite load, should be on the y axis.

library(ggplot2) # load the ggplot2 package before plotting


# boxplot of antibiotic treatment against parasite load

ggplot(tiger_data, aes(x = antibiotic_treatment, y = parasite_load))+
  geom_boxplot()

We can see from our plot that tigers that were treated with antibiotics have lower parasite load than those that were not treated with antibiotics.

Therefore, there is already some indication that the evidence supports our alternative hypothesis. BUT, we need to quantify this difference with a statistical model.

10.4 Step 3: Test for normality

Step 3 involves checking if our numerical variable of interest has a normal distribution.

Our variable of interest is parasite load, which is numeric data. We need to check if it is normally distributed to know what statistical model we can use.

We can do this by:

1) Plotting a histogram of our numerical data; and

2) Performing the Shapiro-Wilk test of normality on the weight data.

You did this in last week’s practical so you should be familiar with this step already.

10.5 Exercise 5

Make a histogram of parasite load and conduct a Shapiro-Wilk test. Is parasite load normally distributed?

# plot a histogram of parasite load

ggplot(tiger_data, aes(x = parasite_load))+
  geom_histogram(col = "white")

# This does NOT look normally distributed. It looks right skewed (tail on the right)

# Conduct shapiro-wilk test

shapiro.test(tiger_data$parasite_load)
## 
##  Shapiro-Wilk normality test
## 
## data:  tiger_data$parasite_load
## W = 0.92318, p-value = 8.65e-06

You should get the following output from the shapiro-wilk test:

The p-val is 8.65e_06, which is equal to 0.00000865 (move the decimal point 6 places to the left!), which is smaller 0.05. This means the parasite load is NOT normally distributed.

Use this table to help you interpret the results of the Shapiro-Wilk test:

p-val Conclusion
< 0.05 Data is NOT normally distributed
> 0.05 Data IS normally distributed


10.6 Step 4: Statistical test

Step 4 involves using the framework above to identify the correct statistical test for the data.

The parasite data is not normally distributed, and our boxplot compares two categories (Antibiotics and no antibiotics). We therefore need a Man–Whitney Wilcoxon U test (sometimes referred to as just “Mann-Whitney U test”, or “Wilcoxon Rank Sum test”).

Run the following code:

# conduct Mann-Whitney-Wilcoxon U test

wilcox.test(parasite_load ~ antibiotic_treatment, data = tiger_data)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  parasite_load by antibiotic_treatment
## W = 1970, p-value = 0.006279
## alternative hypothesis: true location shift is not equal to 0
# that's it! You've fitted a statistical model. 

You should get the following output with the effect size and the p value.

The effect size (W) is 1970. Because this is a non-parametric test, you can’t really interpret this number logically.

The P value is 0.006.

The p value is under 0.05, which means there is a SIGNIFICANT statistical association between antibiotic treatment and parasite load, so we reject our null hypothesis.

10.7 Step 5: Write out results

To summarise your results, you need to include the direction of the effect, and information on your sample size, effect size, and p value. Should also include a nice figure to visualise the relationship you are testing.

Note: in your results section you do not need to refer to your hypotheses. This is how you would write your results in an assignment or scientific manuscript:

From our sample of 110 captive tigers, we found evidence that antibiotic treatment significantly reduced parasite load (Mann-Whitney U test, W = 1970, p = 0.006; Fig. 1).

Figure 1) Effect of antibiotic treatment on parasite load in 110 captive tigers.

# Note, this is the code I used to genera the plot above

ggplot(tiger_data, aes(x = antibiotic_treatment, y = parasite_load))+ 
  geom_boxplot(fill = "pink")+ 
  geom_jitter(width = 0.1, alpha = 0.4, col = "red")+ 
  theme_classic(base_size = 12)+ 
  ylab("Parasite count")+ 
  xlab("Antibiotic treatment")

10.8 Q2: Do antibiotics affect tiger weight?

We have just tested whether antibiotics are effective at treating parasites, but do they affect tiger weight? You think that antibiotics might increase energy availability by removing parasites - which are energetically expensive for hosts - thereby allowing tigers to harvest more energy for weight gain.

Now it is your turn to go through the 5 analysis steps to answer this second question investigating the possible effect of antibiotics on tiger weight.

10.9 Exercise 6

Step 1: Write out your null and alternative hypotheses, making sure you use if/then statements.

NULL HYPOTHESIS: IF antibiotics do not increase energy availability by removing parasites, THEN I expect no difference in weight between tigers given antibiotics and those that aren’t.

ALTERNATIVE HYPOTHESIS: IF antibiotics increase energy availability by removing parasites, THEN I expect tigers given antibiotics to be heavier than those not given antibiotics.

10.10 Exercise 7

Step 2: Plot the relationship between weight and antibiotic use.

ggplot(tiger_data, aes(x = antibiotic_treatment, y = weight_kg))+
  geom_boxplot()

# Does not look like there is any difference in weight between treatments. 

10.11 Exercise 8

Step 3: Test whether your numeric variable of interest (in this case, weight) is normally distributed. Remember, you need to generate a histogram AND do a Shapiro Wilk test.

ggplot(tiger_data, aes( x= weight_kg))+geom_histogram(col = "white")

shapiro.test(tiger_data$weight_kg)
## 
##  Shapiro-Wilk normality test
## 
## data:  tiger_data$weight_kg
## W = 0.98521, p-value = 0.2667

ANSWER: According to a histogram and the results of a Shapiro-Wilk test (p = 0.27), weight IS normally distributed.

10.12 Exercise 9

Step 4: Identify and conduct the appropriate statistical test using the framework for identifying the appropriate model.

ANSWER: The correct test is a Two sample t-test.

# conduct test

t.test(weight_kg~antibiotic_treatment, data = tiger_data)
## 
##  Welch Two Sample t-test
## 
## data:  weight_kg by antibiotic_treatment
## t = -0.014697, df = 107.99, p-value = 0.9883
## alternative hypothesis: true difference in means between group No Antibiotics and group With Antibiotics is not equal to 0
## 95 percent confidence interval:
##  -4.446526  4.381071
## sample estimates:
##   mean in group No Antibiotics mean in group With Antibiotics 
##                       114.0255                       114.0582

10.13 Exercise 10

Step 5: Write our your conclusions based on your plot and model below.

ANSWER:

“I found no evidence from my sample of 110 tigers that antibiotics increases tiger weight (t-test, t = -0.01, p = 0.99)”.

OR

“I found no evidence from my sample of 110 tigers that antibiotics affects tiger weight (t-test, t = -0.01, p = 0.99)”.

OR

“I found no evidence that antibiotics affects tiger weight (t-test, t = -0.01, p = 0.99, n = 110), suggesting that parasite removal by antibiotics does not influence energy availability”.

11 Optional: Advanced models

For those who want to go a step further, below I provide an example of a more sophisticated statistical analysis of the data. However, you do NOT need this for your assignment. You MAY need this for your final year project next year. Therefore I am including it here just in case.

General Linear Models

In this practical and the next practicals we have just focused on UNIVARITE analyses. This means we look at the effect of one variable on one other variable. However, in reality, lots of things are likely to affect your variable of interest.

Ideally you want to include all those variables in your model. You can do that, and the most common model to do this is called the General Linear Model.

Luckily, they are extremely easy to run. Lets look at the independent effects on antibiotic treatment, diet treatment, age and sex on parasite load.

# run model
glm_model<-lm(parasite_load ~ antibiotic_treatment + diet_treatment + age_years + sex, data = tiger_data)

# look at output
summary(glm_model)
## 
## Call:
## lm(formula = parasite_load ~ antibiotic_treatment + diet_treatment + 
##     age_years + sex, data = tiger_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.729 -14.762  -0.782  11.082  66.137 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           44.0592     6.7072   6.569 1.99e-09 ***
## antibiotic_treatmentWith Antibiotics -12.7669     4.1152  -3.102  0.00247 ** 
## diet_treatmentHigh Protein            -3.9135     4.1646  -0.940  0.34953    
## age_years                             -0.1650     0.8817  -0.187  0.85193    
## sexMale                               -5.6231     4.2476  -1.324  0.18843    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.95 on 105 degrees of freedom
## Multiple R-squared:  0.08922,    Adjusted R-squared:  0.05452 
## F-statistic: 2.571 on 4 and 105 DF,  p-value: 0.04206

You should get the following output:

This output tells you:

  • That antibiotic treatment significantly affects parasite load (p < 0.05), but diet treatment, age, and sex and no significant effect (p > 0.05).

  • On average, tigers that were given antibiotics had 12.8 fewer parasite eggs than those that didn’t not get antibiotics.

  • Overall, the model explains 5.5% of variation in parasite load (this is called the Adjected R2).

GLMs have a bunch of assumptions that you need to know about if you are going to use them on your own data, but this serves as an introduction on how to use them.

11.1 Optional Exercise 11

Run a GLM to look at the effect of antibiotics, diet, age and sex on tiger weight.

model_weight <- lm(weight_kg~antibiotic_treatment + diet_treatment + age_years + sex, data = tiger_data)


summary(model_weight)
## 
## Call:
## lm(formula = weight_kg ~ antibiotic_treatment + diet_treatment + 
##     age_years + sex, data = tiger_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.5009  -7.6231  -0.1184   7.2668  30.9649 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          111.9145     3.5643  31.399   <2e-16 ***
## antibiotic_treatmentWith Antibiotics   0.3801     2.1868   0.174   0.8624    
## diet_treatmentHigh Protein            -5.6044     2.2131  -2.532   0.0128 *  
## age_years                              0.4802     0.4685   1.025   0.3078    
## sexMale                                4.2646     2.2572   1.889   0.0616 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.13 on 105 degrees of freedom
## Multiple R-squared:  0.1167, Adjusted R-squared:  0.08302 
## F-statistic: 3.467 on 4 and 105 DF,  p-value: 0.01058

This model tells you that:

  • Tigers fed high protein diets are 5.k kg lighter than those fed high fat diets (t = -2.5, p = 0.013)

  • Male tigers are marginally more likely to be 4.3kg heavier than females (but not quite significant; t = 1.9, p = 0.06)

  • Antibiotics (t = 0.17, p = 0.86) and age (t = 1.025, p = 0.3) do not significantly affect tiger weight

  • Overall these variables explain 8% of all variation in tiger weight (R2 = 0.083).